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(57) ABSTRACT 

A computer implemented method of converting a document 
in an input format to a document in a different output format 
is disclosed. The method generally comprises locating data 
in the input docj^eM^l^plngj^ata^intQ^one^or-more'" 
intermediate format blocks ia^ an^intennediate- format ^ 
document, and converting the mtermediate format document 
*to*theWtpuTformat document using the mtermediate format 
blocks. Each^teTOe"crrate*format"bto^ 
a line, a word, a table, or an image. The input document may 
be received over a network and the output document is sent 
over the network. A linked table of contents and/or an index 
may be generated. A computer executable program may be 
generated and inserted into the output document for select- 
ing one output format for display. The output document may 
be displayed by locating sub-page breaks in the document, 
subdividing the document into sub -pages using the sub -page 
breaks, locating blocks within each sub-page, and sequen- 
tially displaying all or a portion of each block of the 
sub-pages within display parameters of a display configu- 
ration. Tables may be divided t o be disrJ ^yedJn J ,mote.th an - 
one displ ay p^ge . The converter may be incorporated in a 
'computer program product for maintaining a repository of 
input documents in one or more storage formats. 
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For the tatast full-length data sheet pteasa refer to the Micron 
Web site: www. mkrvn.ajnVm&'msp/html/datasheeL html 

FEATURES 

• Single +3.3V ±0.3 V power supply 

• Industry-standard x8 pinout, timing, functions and 
packages 

• 13 row, 10 column addresses (El) or 12 row, 11 column 
addresses (B6) 

• High-performance CMOS silicon-gate process 

- All inputs, outputs and clocks are LVTTL-compatible 

• PAST PAGE MODE (FPM) access 

• 4,096-cycle CAS#-BEFORE-RAS# (CBR) REFRESH 
distributed across 64ms 



FIG. 21 A 



• Optional self refresh (S) for low-power data retention 
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GENERAL DESCRIPTION 

The MT4LC8M8E1 and MT4LC8M8B6 arc high-speed 
CMOS dynamic random-access memory devices 
containing 67,108,864 bits organized in a x8 configuration. 
The MT4LC8M8E1 and MT4LC8M8B6 are functionally 
organized as 8,388,608 locations containing eight bits each. 
The 8,388,608 memory locations arc arranced in 8,192 rows 
by 1,024 columns for the MT4LC8M8E1 or 4,096 rows by 
2,048 columns for the MT4LC8M8B6. Curing READ or 
WRITE cycles, each locations is uniquely addressed via the 
address bits. First, the row address is latched by the RAS# 
signal, then the column address by CAS#. Both devices 
provide FAST-PAGE- MODE operation, allowing for fast 
sucessive data operations (READ, WRITE or READ- 
MODIFY- WRITE) withio a given row. 

The MT4LC8M8E1 and MT4LC8M8B6 must be refreshed 
periodically in order to retain stored data. 
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CONVERSION DATA REPRESENTING A putting the information into an output format. The input and 

DOCUMENT TO OTHER FORMATS FOR output formats may be, for example, portable document 

MANIPULATION AND DISPLAY format (PDF), rich text format (RTF), hypertext markup 

language (HTML) format with style sheets, tabular HTML, 

CROSS REFERENCE TO RELATED 5 extensible markup language (XML), cascading style sheets 

APPLICATIONS (CSS), Netscape Layers, linked and separate pages, Tag 

Image File Format (TIFF) or any other image format such as 

This application claims priority to U.S. Provisional Patent graphics interchange format (GIF), bit map (BMP), or Joint 

Application Ser. No. 60/102,688 filed on Oct. 1, 1998 and Photographic Experts Group (JPEG), formats generated by 

entitled "Processor-Based Method for Converting and Out- text and/or image authoring tools or applications, or any 

putting Digital Data Representing a Document Image," the other suitable formats. 

entirety of which is incorporated by reference herein. A computer implemented method of converting a docu- 
ment in an input format to a document in a different output 

BACKGROUND OF THE INVENTION format is disclosed. The method generally comprises locat- 

1 P' u f th t t* ^8 ^ata m tne * D P ut document, grouping data into one or 

1. rieia 01 tne invention 15 mQre intermediate format blocks ^ ^ intermediate format 

The present invention relates generally to a method for document, and converting the intermediate format document 

converting a document stored in one format to a different to the output format document using the intermediate format 

format. More specifically, a system and method for convert- blocks. Preferably, the grouping includes locating words in 

ing digital data representing an image of a document image the input document, joining words satisfying line threshold 

stored in one format to other formats for manipulation and 20 to into lines, joining lines satisfying paragraph threshold into 

display are disclosed. paragraphs, and locating tables. The grouping may alteraa- 

2. Description of the Related Art tiveI y or further include locating tags (or control characters) 
* , 4 . . c j* *ii j* <■ in the input document and utilizing the tags in locating 
Automatic processing of digital data representing an words> words mto )ines> joinin s ^ h 8 

.mage of a document using a digital computer to recognize, and loc J a(in * b|es Each ilrte J cdia! * &nnat Wc £ k Zybo 

capture and/or store information contained in the document selected from a word> a K a paragraph> , tabl6( an J an 

has been the subject of active research and commercial image 

r dU ^nno° r e ^ m ? le> l^; Pat * N °* 5 ' 737 ' 4 U 42 f ued L 01 ! Each of the input format and output format may be in 

Apr. 7, 1998 to H. Alam discloses a processor based method portable documeDt forma t (PDF), rich text format (RTF), 

for recognizing, capturing and storing tabular data from 3o hypertext markup language (HTML), extensible markup 

digital computer data representing a document, the disclo- language (XML), cascading style sheets (CSS), Netscape 

sure of which is incorporated herein by reference in its Layers, linked and separate pages, Tag Image File Format 

entirety. (TIFF), graphics interchange format (GIF), bit map (BMP), 

However, many other image processing research and Joint Photographic Experts Group (JPEG), MICROSOFT 

products have not focused on the accurate, efficient and 35 WORD™, WORD PERFECT™, AUTOCAD™, and 

automatic capturing of the information contained in a docu- POWER POINT™. 

ment and converting the document to a different format to be In one embodiment, the input document is received over 

displayed, for example. Nor have other image processing a network and the output document is sent over the network, 

research and products focused on allowing the user to the network may be the Internet or an intranet, for example, 

manually or otherwise reformat and/or revise the contents of 40 via electronic mail. Heading of the input document may be 

the document. Further, such image processing research and located to generate a linked table of contents page contain- 

products have also not focussed on the conversion of such ing the headings, each table of contents heading containing 

information to a format that a user may easily manipulate in a link to the heading contained in the output document, the 

order to utilize all or a portion of the information contained table of contents page being placed into the output docu- 

in the document and/or to reformat the document as desired 45 ment. 

into a different layout. For example, it may be desirable for In another embodiment, a computer executable program, 

the user to manipulate the document by cutting, pasting such as a JAVA™ script, may be generated for selecting one 

and/or otherwise editing or revising the document to re for- output format for displayed, the program being inserted into 

mat and/or to fully or partially utilize the information the output document. 

contained in the document such as for analysis and/or other 5Q The methods of the present invention may be imple- 

uses - mented by computer codes stored on a computer readable 

What is needed are accurate and efficient systems and such as CD-ROM, zip disk, floppy disk, tape, flash memory, 

methods for converting a document stored in one format to system memory, hard drive, and data signal embodied in a 

a different format. Such systems and methods preferably carrier wave. 

convert digital data representing an image of a document 55 The output document, for example, may be displayed by 

image stored in one format to other formats for manipulation locating sub-page breaks in the document, subdividing the 

and display, for example. document into sub-pages using the sub-page breaks, locating 

blocks within each sub-page, and sequentially displaying all 
or a portion of each block of the sub-pages within display 

The present invention comprises a method for extracting 60 parameters of a display configuration. Tables may be divided 

data from digital data representing a document, such as a to be displayed in more than one display page. A linked table 

printed document or of an Internet webpage. The method of contents and/or a linked index may also be generated, 

generally comprises locating words from the digital data of In another embodiment, the converter may be incorpo- 

the document in the original or input format, joining the rated in a computer program product for maintaining a 

located words into lines, joining the lines into paragraphs, 65 repository of input documents in one or more storage 

locating tables from the joined paragraphs, converting the formats. A table of contents and/or an index may also be 

paragraphs and tables to an intermediate format, and out- generated. 



SUMMARY OF THE INVENTION 
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BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 23A and 23B show sample display pages by which 

m „ .„ , the table shown in FIG. 22 may be displayed; 

FIG. 1 illustrates an example of a computer system that „ Ti _ , , . _ , 

. . JT * ij« t r*u FIG. 24 shows a schematic of a system over which service 

can be utilized to execute software or an embodiment ot the _ . J , , 

, . t . for converting data representing a document may be pro- 
present invention: r •, , i ✓ r 
r 5 vided over a network; 

FIG. 2 is a system block diagram of the computer system ri „ . a .« # # . f 

f FIG 2- shows a flow diagram illustrating a service for 

' * converting data representing a document over a network; 

FIG. 3 is a flow diagram iUustrating the overall method nQ 2fi shows flow ^ m^^g a process £or 

for convening data representing a document from an ongi- tin , lsamMgt base % r documen t repository using 

nal input format to a different output format; 10 * uniform % torage formaU and 

FIG. 4 is a flow diagram illustrating a step of converting FIG. 27 shows a schematic of a system in which the 

the input data to a different output format; knowledge base or document repository using a uniform 

FIG. 5 is a schematic illustrating conversion of data storage format may be used, 
representing a document to a different output format; 

FIG. 6 is a schematic illustrating conversion of data 15 Dm^^^O^mE 

representing a document to portable document format, to an FRfcr ERRED EMBODIMEN FS 

intermediate format, and finally to a different output format; The present invention comprises systems and methods for 

FIG. 7 shows a flow diagram iUustrating extracting data converting digital data representing an image of a printed 

from an image of a document to convert the data to the 2 o documeot m aQ original or input format to a different output 

intermediate format; format. The following description is presented to enable any 

FIG. 8 shows a flow diagram illustrating the process steps P erson skUled "V 10 , m * e and use "» taction, 

for joining words into lines; Descriptions of specific applications are provided only as 

„„ °, . , , .„ examples. Various modifications to the preferred embodi- 

FIG. 9 shows a portion of a sample document illustrating mem win be readn , tQ tho$e skmed ^ (he and 

the various criteria used for joining words and lines; 25 ^ defiQed herein may be appUed tQ ^ 

FIG. 10 shows a flow diagram illustrating the processing embodiments and applications without departing from the 

steps for joining the lines into paragraphs; spirit and scope of the invention. Thus, the present invention 

FIG. 11 shows a flow diagram illustrating the process for is not intended to be limited to the embodiments shown, but 

converting the document stored in an intermediate format to 3Q is to be accorded the widest scope consistent with the 

an output format; principles and features disclosed herein. 

FIG. 12 shows a flow diagram illustrating the conversion FIG. 1 illustrates an example of a computer system 101 

from an intermediate format to RTF or HTML with style that can be used to execute the software of an embodiment 

sheets output format; of the invention. FIG. 1 shows a computer system 101 that 

FIG. 13 shows a flow diagram illustrating the conversion 35 includes a display 103, screen 105, cabinet 107, keyboard 

from an intermediate format to TIFF output format (or other 109 , «»d mouse 111. Mouse 111 can have one or more 

image formats); buttons for interacting with a graphical user interface. Cabi- 

FIG. 14 shows a flow diagram illustrating a first portion ~' 107 houses a CD ^ 01 ^' md / 01 ^1 di f C f™* 

of the conversion from an intermediate format to tabular system memory and a hard dnve (see FIG. 2) which can 

HTML ourout format' 40 utilized to store and retrieve software programs lncorpo- 

^ ' rating computer code that implements the invention, data for 

FIG. ISA shows a page of a sample document illustrating use ^ me mvention) ^ me ^ Although CD-ROM, 

intermediate format blocks; zip> and floppy ^ U5 m shown as exemplary computer 

FIGS. 15B and 15C illustrate division of the sample readable storage mediums, other computer readable storage 

document page of FIG. 15A into cells of a macro table; 45 me dia including tape, flash memory, system memory, and 

FIG. 16 shows a flow diagram illustrating a second hard drive can be utilized. Additionally, a data signal embod- 

portion of the conversion from an intermediate format to the ied in a carrier wave, such as in a network including the 

tabular HTML output format; Internet or an intranet, can be the computer readable storage 

FIG. 17 shows a page of a sample document illustrating medium, 

the partitioning of a non-divisible cell of a macro table to 50 FIG. 2 is a system block diagram of computer system 101 

generate a highest common factor coordinate table for used to execute the software of an embodiment of the 

placement of each block within the cell at a corresponding invention. As in FIG. 1, computer system 101 includes 

coordinate within the coordinate table; monitor 103 and keyboard 109, and mouse 111. Computer 

FIG. 18 shows a flow diagram of a process for reformat- s y stem 101 fortner Eludes subsystems such as a central 

ting a document into display pages for display on a differ- 55 processor 151, system memory 153, fixed storage 155 (such 

ently configured display as a narc ^ drive and random access memory), removable 

FIG. 19 shows a flow diagram illustrating dividing a f n * " 7 <»* M a C ^ R0 **' (8 ? ) ° r & °PPY drive), 

current block into portions such that each portion is within *\T Card ™' 16 V etw °* 

.1 i - ■ „ . r*u j* 1 a 4- jf interface 165, and printer, facsimile, and/or scanner interface 

the display parameter of the display configuration and for i « ~, r , ' _ . . , 

j- i • *u r *u . li i 60 lo". Other computer systems suitable for use with the 

displaying the portions of the current block; . . . \ , jj. , r . „ 

r ' & r invention can include additional or fewer subsystems. For 

FIG. 20 shows a sample document having sub-page example, another computer system could include more than 

breaks and tables; one processor 151 (such as a multi-processor system) or a 

FIGS. 21A-F show five display pages into which the cache memory, 

sample document of FIG. 20 may be divided; 65 The system bus architecture of computer system 101 is 

FIG. 22 shows a sample table which may be contained in represented by arrows 169. However, these arrows are 

a document; illustrative of any interconnection scheme serving to link the 
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subsystems. For example, a local bus could be utilized to 508 outputs data 510 representing the image of document 

connect the central processor to the system memory and 502 to an optical character recognition (OCR) application 

display adapter. Computer system 101 shown in FIG. 2 is but 526. OCR application software is known in the art and is 

an example of a computer system suitable for use with the commercially available off-the-shelf. OCR application 526 

invention. Other computer architectures having different 5 converts document image data 510 representing the image of 

configurations of subsystems can also be utilized. document 502 or facsimile 506 to a document 518 in a 

FIG. 3 is a flow diagram illustrating a method 300 for ^J™^ * fonMt gCDerited * °* ^ 

converting data representing a document from an original . . j, 4 , . A 

input format to adifferent output format. Conversion method h Amatively, a text and/or image authormg tool 516 may 

300 includes receiving input data at step 302. The step of io be utUtzed to create a text and/or image document 518. Text 

receiving input data may be achieved by receiving or and/ ° r una S e a V thonn S \°° 1 * 16 be > to [?"^™l 

reading data from a computer readable storage medium, ^vvl^J as MICROSOFT 

such £ those listed above, including CD-ROMTzip, floppy W ° RD ™< W0R ? PERFE ^' A ^ £™% 

disk, tape, flash memory, system memory, hard drive. dTta £r*2£5 ™* « .^H 70 , 0 ^ 0 ' P0WER 

. i vj-j- • ti_ j * • i Ljjic POINT™, and/or any other suitable text and/or image 

signal embodied ma carrier wave. The data signal embodied is . ' , ^ ji ou* a *X« 

. . i • *. i authoring tools. Text/image document 518 is output to a 

in a earner wave may be a earner wave in a network , & ' , 6 , . , . 

• i„ j- t„*.™* — „ , document converter 528 which converts text and/or image 

including the Internet or an intranet, or a earner wave , „ . , 

i i • j • , . , ii i * i document 518 to an intermediate format document 530. 

delivered via a computer port, such as a parallel, serial, or _ „ 0 . . , , 

Universal Serial Bus (USB) printer port, including data Co™*" 528 is in essence a document translator that may 

signal delivered via a facsimile machine and/or a scanner. 20 be incorporated into for example, a printer driver such that 

... , . ■ .... ... the data received from sources such as a scanner or a 

Method 3UU then determines whether the input data is in facsimile machine mav be 

converted directly to a document 

a format supported as an input format at step 304. The fa ^ intermediate format . 

supported input formats are preferably same as those avail- j , A - A , A _. n . . , . 

li t . t « ,/ u a. • * r * Intermediate format document 530 is received by a con- 
able as output formats although the input formats may „ ... A . , A r A , J 
. , j e u-.- u * * . . 25 verter 532 which converts intermediate format document 
include fewer, more, or any combination or subset of output _- A A , A - Aj „ 
£ . t- i • - • • * •* u 530 to an output format document 534. The output format 
formats. For example, in certain circumstances, it may be u r i n^wr V w* j „ . r 

, . u , * ii De » f° r example, HTML or XML and output format 

desirable to support or allow many different input formats , 1 ' r « ... . . V- 

while allowing only one specific output formal. The sup- * xa V nent 534 ,, ma y * " S ^d™" °' 

ported input and/or output formats may include one or more *l™°'J™ h ni" INTERNE f T EXPLORER™ or 

versions of HTML, XML, PDF, RTF, CSS, Netscape Layers, NETSCAPE™ .The conversion of a document to an 

linked and separate pages, Tag Image File Format ^IFF) or ^^FK^S ^ m ° r6 

any other image format, formats generated by text and/or D T , ' L j. . 

image authoring tools or applications, or any other suitable . In ano±er embodiment as shown in FIG 6, document 

formats image 510 and text and/or image document 518 are input to 

/ * ma •* - a * -a *u ♦ .u • . ,1 * • • 35 ACROBAT CAPTURE™ application software 612A and 

If at a step 304 it is determined that the input data is in ACROBAT WRITER™ application software 612B, 

a format supported as an input format, then the input data is tivel each of which ™ puts a PDF document 626. 

converted to one or more output formats different from the A ' _ o ArD ^ n A ™ oadtttdctm a 

. . , . 4 - t 4 4 *\* n , ™_ 4 4 Application programs ACROBAT CAPTURE™ and 

onginal mput format at a step 306. The one or more output ArinnAT u/ditddtm a. j * 

r to ' r , vcji_ *u nr ACROBAT WRITER™ are software products commer- 

formats may be specified by the user, all of one or more . „ . t . , - . , . 0 , ™C, , 4 . 

. ^ f \ _*ju *u j iaa j / • , 40 cially available from Adobe Systems. PDF document 626 is 

output formats supported by method 300, and/or determined . t ... ... ™r, , 

• j . |« < * . l*l j received by a converter 628 which converts the PDF docu- 
based upon the application or device to which the converted a- * * *j * ^ • . 

, , : ^. ± i.a j i"* i j ment to an intermediate format document 530. The mter- 

data output is outputted. For example, the output device may „ . * 4 , t tt ^ M - ... 

. L1 ,. T , . A A ' . . * ^ 7 mediate format document is output to converter 532 which 

be a portable digital assistant (PDA) which supports one or t • , # f t , r . 4 

c.u * * r * * j u converts intermediate format document 530 to an output 

more of the output formats supported by method 300. ^ format document 5M ^ nol&d tfae ^ 

Alternatively, if at a step 304, it is determine that the input may 5e> for examplej HTML or XML and the output format 

data is not in a format supported as an mput format, then document 534 may be output to an output application or 

method 300 terminates without converting the input data. dev ice, such as INTERNET EXPLORER™ or 

Method 300 may also output an error message indicating NETSCAPE™ 

that the input data is not in a format supported as an input _„ -ru • * a ■ . c . * r ui r t iU * i_ 

format intermediate format is preferably a format that can be 

easily utilized to transfer the data representing the contents 

FIG. 4 is a flow diagram illustrating an embodiment of of the documents to any other desired output format. In 

step 306 of converting the input data to a different output essencej the intermediate format serves as a document 

format. Step 306 comprises converting the input data to an translator. The intermediate format document preferably 

intermediate format at a step 402. The intermediate format 55 includes information including characters and their fonts 

is then used to generate the output data in one or more output (including italics), sizes, weights (bold or normal), 

formats at step 404. underlines, and locations within a document. The interme- 

FIGS. 5 and 6 are schematics illustrating an embodiment diate format document preferably groups characters into 

of converting data representing a document to a different words, lines, paragraphs, and/or tables. Each group is stored 

output format. FIG. 5 illustrates conversion of data repre- 60 in the intermediate format document as an intermediate 

senting a document to an intermediate format and then to a format block. The intermediate format block may also store 

different output format and FIG. 6 illustrates conversion of an image or other grouped or blocked portion of the input 

data representing a document to PDP, to an intermediate document. The intermediate format preferably also retains 

format, and finally to a different output format. information on bookmarks, document links, raster images 

As shown in FIG. 5, a document 502 may be scanned by 65 and vector images contained in the input document. Further, 

a scanner 504 or a facsimile 506 may received by a facsimile the intermediate format preferably retains or transfers any 

machine 508. Each of scanner 504 and facsimile machine embedded animation, sounds and/or music, as well as the 
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execution of links to start up other applications. For 
example, the intermediate format may be a listing of the 
intermediate format blocks along with their X and Y coor- 
dinates. 

Each intermediate format block may be an image, a 5 
paragraph, an element in a table, or all or a portion of the 
table, depending upon the spacing of the elements of the 
table. The information stored in the intermediate format can 
be easily converted to the desired output format. 

The overall process of converting to the intermediate 10 
format having been described, the details of the conversion 
process will now be described. FIG. 7 shows a flow diagram 
illustrating an embodiment of step 402 of extracting data 
from an image of a document and convert the data to the 
intermediate format. Step 402 includes locating and storing 15 
tags in the input format document at a step 700, locating 
words from the digital data at a step 702, joining the located 
words into lines at step 704, joining the lines into paragraphs 
at a step 706, locating tables from the joined paragraphs at 
a step 708, and outputting the intermediate format data 20 
generated from steps 702, 704, 706, and 708 at step 710. 
Details of each of steps 702, 704, 706, and 708 are discussed 
in more detail below. 

Step 700: Locate and Store Tags in Input Format Document 

Text representation of documents in certain formats, such 25 
as WORD™, may contain tags (or control characters). The 
process may first recognize the input format, such as 
WORD™, of the input document. If the tags of the input 
document are recognizable, then dictionary tags for that 
input format or type may be utilized to translate the located 30 
tags into the intermediate format. 

Each tag may be associated with a specific portion of the 
document. Tags generally contain information about the 
specific portion such as identification as a heading, a table, 
a paragraph or a list and/or other information such as 35 
alignment, font, etc. Step 700 thus locates and stores the 
tags, if any, and the associated information contained 
therein. The tags may be complete or the tags may be 
incomplete tags and do not provide complete information 
about the specific portion of the document. The tags may be 40 
utilized to facilitate execution of subsequent steps, such as 
steps 702, 704, 706 and 708. If results of such subsequent 
steps conflict with the information contain in the tags, the 
results from the steps preferably supersede or replace the 
information in the tags. In other words, tags are preferably 45 
used as baseline or default results or settings; Alternatively, 
if the tags are complete, the process may bypass steps 702, 
704, 706 and 708. 

Documents in certain other formats, such as PDF, do not 
contain tags. In such case, results from the subsequent steps, 50 
such as steps 702, 704, 706 and 708, are used to obtain the 
information which would otherwise be contained in the tags. 
The subsequent steps utilize the layout information (i.e., 
image representation) of the text to of the document locate 
words, lines, paragraphs, and tables, for example. 55 
Step 702: Locate Words in Input Format Document 

In locating words from digital data representing an image 
of a document at step 702, the digital computer utilizes 
information provided for each word by the digital data in an 
input format. The information provided by the digital data in 60 
the original input format may include, for example, X and Y 
coordinates for the top left and bottom right of the word 
relative to the page as well as the font of the word. The font 
information includes information on the style, size, weight 
(bold or non-bold), stroke (italics or non-italics) and orien- 65 
tation of the word. For purposes of discussion only, the X 
axis is assumed to extend along the width (horizontal 
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direction) of a page and the Y axis is assumed to extend 
along the length (vertical direction) of the page, in either 
portrait or landscape orientation. The individual words are 
then sorted by their X and Y coordinates, preferably first by 
the Y coordinate in the vertical direction and then by the X 
coordinate in the horizontal direction. However, the precise 
method by which the words are sorted may be varied. 
Step 704: Join Words into Lines 

FIG. 8 shows a flow diagram illustrating step 704 of 
joining the located and sorted words into lines. To join the 
located and sorted words into lines, the first word from the 
listed of sorted words is assigned to a first line at step 802. 
This first line may be defined as the current line. A next word 
is then picked or selected at step 804. 

A determination is made whether the selected word is in 
the current line at step 806. To determine whether the 
selected word is in the current line, the appropriate Y 
coordinate^), i.e., in the vertical direction, of the selected 
word are compared with the appropriate Y coordinate^) of 
the previous word in the current line to determine whether 
certain line parameters and/or thresholds are satisfied. For 
example, the top Y coordinate of the selected word may be 
compared with the top Y coordinate of the previous word in 
the current line to determine the inter-word spacing in the Y 
direction. If the inter-word spacing or distance in the Y 
direction is greater than a threshold of, for example, 10% of 
the average character height, then the inter-word spacing 
parameter in the Y direction is not met and the word is 
determined not to be in the current line. The average 
character height may be determined from the words in the 
current line or from all the words in the document, for 
example. Of course, other suitable comparisons and/or 
analysis may be made by step 806 to determine whether the 
selected word is in the current line. 

If at a step 806, it is determined that the selected word is 
not in the current line, step 808 determines whether the word 
is in any existing line, i.e., a line having at least one word 
assigned thereto. This may be determined with an analysis 
similar to those described above with reference to step 806, 
For example, if an upper and/or lower Y coordinate is 
assigned to each existing line, a determination may be made 
of whether the upper and/or lower Y coordinate of the 
selected word falls within a threshold distance above or 
below the upper and/or lower Y coordinate of any other 
existing lines. The line threshold distance may be, for 
example, 10% of the average character height. Alternatively, 
a determination may be made of whether the upper and/or 
lower Y coordinates of the selected word fall within a 
threshold distance above or below the upper and/or lower Y 
coordinates of one or more words on the other existing lines. 
The comparison of the Y coordinates is repeated for each of 
the other existing lines until all of the other existing lines are 
examined or until the selected word is determined to be in 
an existing line. If it is determined that the selected word is 
in an existing line, then that existing line is defined as the 
current line at step 809. 

After step 806 determines that the selected word is in the 
current line or after another existing line is set as the current 
line at step 809, step 810 determines whether the selected 
word is within a certain threshold distance or spacing. For 
example, the appropriate X coordinate of the current 
selected word is compared with the appropriate X coordinate 
of the previous word in the current line to determine whether 
the distance between the words in the X (horizontal) direc- 
tion are within the threshold distance. In particular, the top 
left X coordinate of the selected word may be compared with 
the bottom right X coordinate of the left-most and/or righl- 
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most word to determine the spacing between the words in 912 is added to a new line. The new line is defined as the 

the X direction. If the inter-word spacing in the X direction current line. The remainder of the words in document 900 

is greater than a threshold distance, for example, 2.5 times are joined into one or more existing and/or new lines in a 

the character width or 2.5 times the average character width, manner similar to that described above, 

then the inter-word spacing threshold is exceeded and the 5 step 706: Join Lines into Paragraphs 

selected word is determined not to be in the current line. The FIG. 10 shows a flow diagram illustrating the processing 

threshold inter-word spacing in the X direction may be a sl for joming me foes inl o paragraphs after each of the 

statistic of the inter-word spacing and may be dynamicaUy WQrds m me sorted ^ Qf WQrds ha& beeQ tQ a ^ 

determined Two words positioned approximately at the Tq - ^ ^ ^ mtQ hs> the ^ ^ is ass ^ ned 

same vertical position on a page may not be on thesamehne ^ fa a ^10 0 2. This first paragraph is 

tor example, when the words are positioned in different JCJ .i. * ,. • * ■ ^ * 

columns with spacing between the columns. defin f d a V he current paragraph. A next line is then picked 

If step 808 determines that the selected word is not on or selected at step 1004. 

another existing line, a new line is started at step 812 by Preferably, three criteria are met prior to assigning a 

adding the selected word to a new line. The new line is then selected line to a given paragraph. The three criteria are: (1) 

defined as the current line. Otherwise, if step 810 determines 15 me selected line is near the paragraph in the Y direction as 

that the selected word is not within the threshold distance, determined at step 1006; (2) the selected line overlaps the 

the process continues from step 808 to determine if the paragraph vertically in the X direction as determined at step 

selected word is on another existing line. 1010; and (3) the words of the selected line have the same 

If step 810 determines that the selected word is within the font size as the words in the paragraph as determined at step 

threshold distance, then the selected word is added to the 20 1012. These criteria and steps 1006, 1010, and 1012 are 

current line at step 814. After either step 812 or step 814 described in more detail below. 

adds the selected word to the current line or to a new line, After selecting the next line at step 1004, step 1006 

step 816 determines whether there are any remaining words determines whether the selected line is near the current 

in the sorted list of words, i.e., words that remain unassigned paragraph in the Y direction. To determine whether the 

to a line. If there are any remaining words unassigned to a 25 selected line is near the current paragraph in the Y direction, 

line, the process continues from step 804 to select a next the appropriate Y coordinate^) of the selected line are 

word. If step 816 determines that all words have been compared with the appropriate Y coordinate^) of the pre- 

assigned to a line, the process of joining words into lines is vious line of the current paragraph to determine whether 

complete. certain parameters and/or thresholds are satisfied. 

Illustration of Joining Words Into Lines At step 704 30 For example, the upper Y coordinate of the selected line 

FIG. 9 shows a portion of a sample document 900 may be compared with the lower Y coordinate of the 

illustrating various criteria used for joining words into lines previous line in the current paragraph to determine inter-line 

at step 704. For example, a line is started with word 902, a spacing in the Y direction. If the inter-line spacing in the Y 

first word in the list of sorted words (e.g., sorted by position direction is greater than a threshold, for example, 1.75 times 

in the document). The line is defined as the current line. A 35 the average character height, then the inter-line spacing 

next word 904 in the list of sorted words is selected and threshold in the Y direction is not satisfied and the line is 

determined to be in the current line, i.e., within the para- determined not to be near the current paragraph in the Y 

graph threshold distance in the Y direction. Selected word direction. In addition, if the selected line is at approximately 

904 is also within the paragraph threshold distance in the X the same position in the Y direction as the previous line in 

direction and thus is added to the current line. 40 the current paragraph, such as within 10% of the average 

A next word 906 in the list of sorted words is then selected character height above or below the Y coordinate of the 

and it is determined that word 906 is in the current line, i.e., previous line in the current paragraph, the inter-line spacing 

the upper and/or lower Y coordinate(s) of word 906 is within does not satisfy the minimum inter- line spacing threshold in 

the threshold distance of the corresponding Y coordinate^) the Y direction and the line is determined not to be near the 

of word 902, word 904, and/or the current line. It is also 45 current paragraph in the Y direction. Of course, other 

determined that word 904 has X coordinate(s) which are suitable comparisons and/or analysis may be made by step 

within threshold distance(s) from the X coordinate^) of 1006 to determine whether the selected line is near the 

word 902, word 904 and/or the current line. Thus, word 906 current paragraph. 

is added to the current line which already includes words If step 1006 determines that the selected line is not near 

902 and 904. 50 the current paragraph, step 1008 determines whether the 

A next word 908 in the list of sorted words is then selected selected line is near any other existing paragraph, i.e., a 

and determined to be in the current line as the upper and/or paragraph which has at least one line assigned thereto. This 

lower Y coordinate^) of word 908 are within the threshold may be determined with analysis similar to that described 

distance(s) of the corresponding Y coordinate^) of the above with reference to step 1006. 

current line and/or of any words in the current line. 55 If step 1006 determines that the selected line is near the 
However, because it is determined that the distance between current paragraph, or if step 1008 determines that the 
word 908 and any word of the current line, i.e., words 902, selected line is near another existing paragraph which is then 
904, 906, is not within the inter-word distance threshold defined as the current paragraph, step 1010 determines 
along the X direction, word 908 is not added to the current whether the selected line vertically overlaps the current 
line. After determining that word 908 is not in any other 60 paragraph. A selected line vertically overlaps the current 
existing line, a new line is started and defined as the current paragraph if the selected line has the same alignment as the 
line. current paragraph, for example, left, right or center align- 
In a similar manner, a next word 910 is selected, deter- ment. 
mined to be in the current line and within the threshold For example, if the left X coordinate of the first word of 
distance, and added to the current line. 65 the current line is within a threshold distance relative to the 
A next word 912 is selected and determined not to be in left X coordinate of the first word of the previous line in the 
the current line nor on any other existing line such that word current paragraph, then both the selected line and the current 
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paragraph are left aligned and thus overlap. However, as does not satisfies the inter- line spacing criteria in the Y 

there may be an indented first line in a paragraph, the direction for any other existing paragraphs and thus line 922 

threshold distance may be defined to be a larger number is added to a new paragraph which is defined as the current 

when comparing the left X coordinate of the first word of the paragraph. 

current line with the left X coordinate of the first word of a 5 As there are lines unassigned to a paragraph, next line 924 

first line in the current paragraph to account for the hanging is selected. It is determined that line 924 is not near the 

indent. current paragraph containing line 922. It is then determined 

If the right X coordinate of the last word of the current line that line 924 is near the paragraph containing line 920 and 

is within a threshold distance from the right-most X coor- defines that paragraph as the current paragraph. In addition, 

din ate of the last words of the lines of the current paragraph, 10 it is determined that line 924 overlaps the current paragraph 

then both the selected line and the current paragraph may be as line 924 and the current paragraph are both left aligned, 

right aligned and thus overlap. Further, if the center X However, because line 924 does not contain the same font 

coordinate of the current line, i.e., the average of the left X size as the current paragraph and line 924 is not near any 

coordinate of the first word and the right X coordinate of the existing paragraphs, line 924 is added to a new paragraph, 

last word of the current line, is within a threshold distance 15 which is then set as the current paragraph, 

less or greater than the center X coordinate of the previous In a manner similar to that described above, line 926 is 

existing line in the current paragraph, i.e., the average of the determined not to be near the current paragraph containing 

left X coordinate of the first word and the right X coordinate line 924 but is near the paragraph containing line 922 and 

of the last word of the previous existing line of the current defines that paragraph as the current paragraph. It is deter- 

paragraph, then both the selected line and the current para- 20 mined that line 926 overlaps the current paragraph as line 

graph may be center aligned and thus overlap. The threshold 926 and the current paragraph are both right aligned, 

distance may be, for example, 0.5 of the width of a character However, because line 926 does not contain the same font 

of the average width of a character. size as the current paragraph, line 926 is assigned to a new 

The above are merely illustrative examples for determin- paragraph. As there are lines unassigned to a paragraph, the 

ing the alignment of the lines and whether a line near a 25 next fine 928 is selected as the current line, 

paragraph is similarly aligned. Other suitable methods may Line 928 is determined to be near the current paragraph 

be utilized. For example, the above method may be modified containing line 926 and overlaps the current paragraph 

to only evaluate the last existing line of the current para- because line 928 and the current paragraph are both right 

graph to determine whether the current line is similarly aligned. It is also determined that line 928 contains the same 

aligned. 30 font size as the current paragraph and line 928 is assigned to 

If step 1010 determines that the selected line overlaps the the current paragraph containing line 926. 

current paragraph, step 1012 then determines whether the The remainder of the lines in document 900 are joined 

words of the selected line has the same font size as the words into one or more new and/or existing paragraphs in a manner 

of the current paragraph. As discussed above, the digital data similar to that described above, 

in the input format provides information on the font of each 35 Step 708: Locate Tables 

words, including the style, weight to indicate bold or not After the words are joined into lines and the lines joined 

bold and size. into paragraphs, tables are located at step 708. Any suitable 

If step 1008 determines that the selected line is not near method may be utilized to locate tables from the joined 

any other existing paragraph, if step 1010 determines that paragraphs. For example, U.S. Pat. No. 5,737,442 issued on 

the selected line does not overlap with the current paragraph, 40 Apr. 7, 1998 to H. Alam, discloses a processor based method 

or if step 1012 determines that the words of the selected line for recognizing, capturing and storing tabular data from 

does not have the same font size as the words of the current digital computer data representing a document, the disclo- 

paragraph, then a new paragraph is started by adding the sure of which is incorporated herein by reference in its 

selected line to a new paragraph and setting the new para- entirety. 

graph as the current paragraph at step 1014. 45 One method of locating tables from a document in the 

If step 1012 determines that the font size of the words of original input format at step 708 generally comprises evalu- 

the selected line is the same as that of the words of the ating a horizontal projection profile of the document, deter- 

current paragraph, then the selected line is added to the mining upper and lower boundaries of a table by analyzing 

current paragraph at step 1016. After either step 1014 or step white space disclosed by the horizontal projection profiles, 

1016 adds the selected line to a paragraph, step 1018 50 evaluating a vertical projection profile of the document, and 

determines if any lines remain to be assigned to a paragraph. determining a horizontal location of the table by analyzing 

If there are remaining lines to be assigned to a paragraph, the white space disclosed by the vertical projection profiles, 

process continues from step 1004 to select a next line. If all FIG. 11 shows a flow diagram illustrating process 404 for 

lines have been assigned to a paragraph, the process of converting the data stored in an intermediate format to the 

joining lines into paragraphs is complete. 55 desired output format. The intermediate format is converted 

Illustration of Joining Lines into Paragraphs at Step 706 to one or more of the supported output formats at step 1102. 

Referring again to FIG. 9, the portion of sample document As noted above, the output format may be one or more 

900 also illustrates the various criteria used for joining into versions of HTML, XML, CSS, Netscape Layers, linked and 

lines into paragraphs at step 706. separate pages, PDF, TIF (or other image formats such as 

For example, after a first line 920 is added to a first 60 GIF, BMP, JPEG), RTF, and any other formats, although 

paragraph and the first paragraph is defined as the current only exemplary output formats RTF 1104, HTML (tabular or 

paragraph, the next line 922 is selected. It is then determined with style sheets) 1106, TIFF (or other image formats) 1108, 

that line 922 is not near the current paragraph because the Y and XML 1110 are shown. Because HTML Version 3.2, for 

coordinate of line 922 is at approximately the same position example, does not allow placement of block at specified 

in the Y direction as the previous line 920 in the current 65 coordinates while HTML Version 4,0, for example, allows 

paragraph such that the minimum inter-line spacing in the Y specification of coordinates for placement of block, conver- 

direction is not satisfied. It is also determined that line 922 sion process 404 preferably supports both HTML types. 
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Thus, providing a conversion process to generate HTML 
with style sheets as well as tabular HTML supports differing 
versions of HTML. The output may include documents in 
one or more of possible output formats. 

FIG. 12 shows a flow diagram illustrating step 1102 of 5 
converting from the intermediate format document to RTF 
or HTML with style sheets output format document 1104 or 
1106. To convert to RTF or HTML with style sheets output 
format, the top left and bottom right X and Y coordinates 
may be determined for each block in the intermediate format 1Q 
at step 1202. As noted above, the information stored in the 
intermediate format may include one or more blocks. Each 
block may be a paragraph, an element in a table, all or a 
portion of the table, depending upon the spacing of the 
elements of the table, or an image. 

An output format block is generated for each block of the 15 
intermediate format at step 1204. Output format blocks are 
created such that the coordinates of the output format blocks 
in the output format style sheet correspond to coordinates of 
the intermediate format blocks. The font of each interme- 
diate format block is mapped to a font in the output format 20 
font at step 1206 such that each block in the intermediate 
format fits in the corresponding output format block. Each 
output format block with the output format mapped font is 
then placed in the corresponding output format text block at 
step 1208. 25 

Blocks in the intermediate format may be processed by 
process 1212 individually such that process 1212 is executed 
once for each intermediate format block, in multiple groups 
such that process 1212 is executed once for each group of 
intermediate format blocks, or all at once such that process 30 
1212 is executed once for all the intermediate format blocks. 
After completion of all iterations of process 1212, an output 
RTF or HTML with style sheets format document is out- 
putted at step 1210. 

FIG. 13 shows a flow diagram illustrating step 1102 of 35 
converting from the intermediate format to the TIFF output 
format (or other image formats). First, a bitmap of the 
document is generated using the intermediate format blocks 
at step 1302. The bitmap of the intermediate format docu- 
ment is placed into a TIFF output document at step 1304. 40 
Finally, the TIFF output document is output at step 1308. 

FIG. 14 shows a flow diagram illustrating a first process 
of step 1102 of converting from the intermediate format to 
tabular HTML output format 1108. As noted above, HTML 
Version 3.2, for example, does not allow placement of block 45 
at specified coordinates. Thus, conversion process prefer- 
ably includes generation of a grid in a tabular HTML output 
document. The grid may generally be a table having, 
preferably, a minimal number of cells. 

To convert to the tabular HTML output format, a list of 50 
upper and lower Y coordinates, yl, y2, of each block is 
created at step 1402. The list of Y coordinates is scanned to 
locate gaps or spaces between blocks in the Y direction and 
the upper and lower Y coordinates, yl', y2\ of each gap 
between blocks is recorded at step 1404. As is evident, the 55 
Y coordinates, yl', y2', of each gap generally correspond to 
the yl Y-coordinate of one block and y2 Y-coordinate of 
another blocks. Similarly, a list of left and right X 
coordinates, xl, x2, of each block is created at step 1406. 
The list of X coordinates is scanned to locate gaps or spaces 60 
between blocks in the X direction and the upper and lower 
X-coordinates, xl', x2', of each gap between blocks is 
recorded at step 1408. As is evident, the X coordinates, xl', 
x2\ of each gap generally correspond to the xl X-coordinate 
of one block and the x2 X-coordinate of another block. 65 

Next, "m" is assigned to be the number of yl', i.e., the 
number of gaps in the Y direction, and "n" is assigned to be 



124 Bl 

14 

the number of xl', i.e., the number of gaps in the X direction, 
at step 1410. A macro table with m+1 number of rows and 
n+1 number of columns is then created at step 1412. 

The border between row j and row j+1, where j ranges 
from 1 to m, is positioned at yl/ Y coordinate. Thus, the 
height of each row is the distance between two borders along 
the Y direction. For a row which extends to an edge of the 
page in the Y direction, the height of such a row is the 
distance from the edge of the document, i.e., the minimum 
or maximum X coordinate of the table being divided, to the 
row border. If there is only one row, the height is simply 
equal to the maximum X coordinate of the table being 
divided. In addition, the border between column i and 
column i+1, where i ranges from 1 to n, is positioned at xl- 
X coordinate. Thus, the width of each column is the distance 
between two borders along the X direction. For a column 
which extends to an edge of the page in the X direction, the 
width of such a column is the distance from the edge of the 
document, i.e., the minimum or maximum Y coordinate of 
the table being divided, to the column border. If there is only 
one column, the width is simply equal to the maximum Y 
coordinate of the table being divided. 

At step 1414, each cell of the macro table is recursively 
subdivided using above-described process 1400. In the first 
iteration of process 1400, "macro table*' refers to the table 
encompassing the entire page or document. In each subse- 
quent iteration of process 1400, "macro table" refers to a 
table encompassing only a cell of a higher-level macro table 
being sub-divided. In either case, the maximum and mini- 
mum X and Y coordinates for all subsequent iterations of 
process 1400 are those of the cell of the higher-level macro 
table being sub-divided. Process 1400 is repeated until each 
cell of the initial and all subsequent macro tables can no 
longer be divided. Each cell of the macro table may include 
one or more intermediate format blocks. 

FIG. 15A shows a page of a sample document and FIG. 
15B illustrates approximate division of the sample document 
page of FIG. 15A into cells of a macro table. As shown in 
FIG. 15B by dashed lines, the macro table is divided into 
cells in five rows and single column in the first iteration. 
Further, each block is designated with a border around the 
block. The horizontal span of the cell of the first or top row 
prevents this first macro table from being further divided. 
After all iterations of subdividing the highest-level macro 
table, each block occupies a single cell of the HTML table. 
FIG. 15C shows an example of a subsequent iteration of 
dividing a macro table. Specifically, the cell of the last row 
of the first macro table is itself a lower-level macro table 
which can be divided into two columns. Although not 
shown, further subdivisions of other cells of the first or 
highest-level and subsequent or lower-level macro tables is 
possible. 

FIG. 16 shows a flow diagram illustrating a second 
process 1600 of step 1102 to convert an intermediate format 
document to a tabular HTML output document. Process 
1600 attempts to partition each nonndivisible cell generated 
by the first process 1400 and places each intermediate 
format block at the corresponding coordinate in the output 
tabular HTML document. 

Specifically, a first cell of all the macro tables is selected 
at step 1602. The first cell may be the cell having the 
smallest upper left X coordinate and/or the smallest upper 
left Y coordinate. Each cell may include one or more 
intermediate format blocks. Starting at the top left corner of 
the selected cell, a vector of the X coordinate of the left edge 
and a vector of the Y coordinate of the top edge of each 
block in the cell is generated at step 1604. Each Y direction 
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vector has an X coordinate corresponding to the left edge of 
the corresponding block and each X direction vector has a Y 
coordinate corresponding to the top edge of the correspond- 
ing block. The highest common factor for each of the X and 
Y coordinates of the Y-direction and X-directioD vectors, 5 
respectively, is determined at step 1606. 

A table of X and Y coordinates is generated at step 1608 
where the X and Y coordinates are multiples of the highest 
common factor for the X and Y coordinates, respectively. 
The intermediate format blocks within each cell are then 10 
positioned at the corresponding coordinates of the HTML 
table at step 1610. Step 1612 determines if the selected cell 
is the last cell of the intermediate format document or if 
there is any cell that has not been selected. If the selected cell 
is not the last cell of the intermediate format document or if 1 5 
there are unselected cells, then step 1614 selects the next cell 
and continues from step 1604. If the selected cell is the last 
cell or the last selected cell of the intermediate format 
document, then the HTML table containing the blocks 
therein is outputted as an output tabular HTML document at 20 
step 1616. 

As an example to illustrate the determination of highest 
common factor at steps 1606 and the generation of a table 
within the cell at step 1608, if the X coordinates of the left 
edges of the blocks in the cell are 3, 12, 30 and 45, the 25 
highest common factor would be 3. Thus, the table of X 
coordinates generated by step 1608 would be 3, 6, 9, 12, 15, 
18, 21, 24, 247, 30, 33, 36, 39, 42 and 45, i.e., multiples of 
the 3, the highest common factor. 

FIG. 17 shows a portion of a sample document illustrating 30 
the partitioning of a non-divisible cell of a table into a table 
of X and Y coordinates, although only the positions of the 
partitioning X coordinates are shown for purposes of clarity. 
In the sample document portion shown, each line of text 
containing more than one block may become a macro table 35 
which is further divided such that each block is an element 
of the macro table. The line segments shown indicate 
multiples of the highest common factor of the X coordinates 
of the blocks of each macro table. 

Reformatting for Display on Differently Configured Dis- 40 
plays 

The above-described conversion process may be utilized 
to convert data representing a document to a format suitable 
for display in a display having configuration different from 
those for which the input format is suitable. For example, a 45 
document may be in a format suitable for display on a 
typical desktop or laptop monitor and it may be desirable to 
convert the document to another format suitable for display 
on, for example, internet connected televisions and/or por- 
table devices such as cellular or wireless telephones, PDAs, 50 
pagers, and/or voice products. The different configuration 
requirements may be attributable to different display sizes 
and/or resolutions, for example. 

FIGS. 18-28 illustrate the process for and examples of 
such reformatting for different display configurations. 55 
Reformatting process 1800 may include determining sub- 
page breaks in a document and subdividing the document 
into sub-pages at step 1802. A sub-page break may be a 
divider line either horizontally or vertically across a page, 
for example. The first sub-page is then selected as the current 60 
sub-page at step 1804 and the first block in the current 
sub-page is selected as the current block at step 1806. If it 
is determined that the current block is within the display 
parameter of the display configuration at step 1808, then the 
current block is displayed at step 1810. If the current block 65 
is determined not to be within the display parameter of the 
display configuration at step 1808, then the current block is 
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divided into portions such that each portion is within the 
display parameter of the display configuration and the 
portions are displayed at step 1812. 

After step 1810 or step 1812, if step 1814 determines that 
there are remaining blocks in the sub-page, then the next 
block in the sub-page is selected as the current block at step 
1816 and the process continues from step 1808. However, if 
step 1814 determines that there are no remaining blocks to 
be displayed in the current sub-page, then step 1818 deter- 
mines if there are any remaining sub -pages in the document. 
If there are remaining sub-pages in the document, the next 
sub-page is selected as the current sub-page at step 1820 and 
the process continues from step 1806. If there are no 
remaining sub-pages in the document, reformatting process 
1800 is complete. 

In one embodiment, after displaying a block such as at 
step 1810 or after displaying the last portion of a block such 
as at step 1812, process 1800 may determine if the block is 
a paragraph that ends with an incomplete sentence or an 
improper termination. 

The determination of whether a block is a paragraph may 
be achieved by determining whether the block contains one 
or more sentences. A sentence may be defined as having an 
initial capitalization followed by a sentence termination 
punctuation such as a period, exclamation mark, or a ques- 
tion mark which represents the termination of the sentence. 
It may be determined that the block is not a paragraph, such 
as in cases where the block is a bullet point or an item in a 
listing of multiple items. If the block is determined to be a 
paragraph terminating with an incomplete sentence or an 
improper termination, then it is determined if the next block 
begins with an improper sentence or paragraph beginning. 

If the block is not a paragraph that ends with an incom- 
plete sentence or an improper termination, process 1800 
may continue to step 1814 as described above. If the next 
block is a paragraph that ends with an incomplete sentence 
or an improper termination, then the process may determine 
if the next block begins with an improper sentence or 
paragraph beginning. An improper sentence or paragraph 
beginning may contain an initial incomplete sentence with- 
out an initial capitalization but containing other initial 
capitalization^) and sentence terminations. Alternatively or 
additionally, an improper sentence or paragraph beginning 
may contain a non-indented first line while the first line of 
previous paragraph^) is indented, for example. 

If the next block is not a paragraph or is not a paragraph 
that ends with an incomplete sentence or an improper 
termination, then the process examines a predetermined 
number of subsequent blocks or original document pages or 
blocks in a predetermined area of the document, for 
example, to locate the first subsequent block containing a 
paragraph. If no paragraph is located or if the located 
paragraph does not begin with an improper paragraph 
beginning, then the process may continue to execute step 
1814 as described above. If a paragraph with an improper 
paragraph beginning is located, then that paragraph block 
may be displayed immediately prior to displaying any 
intervening blocks. The process then continues from step 
1814 as described above with only the remaining undis- 
played blocks. 

In another embodiment, matching of two incomplete 
paragraphs may be achieved by examining blocks located to 
the right of the initial incomplete paragraph, rather than 
simply searching for the second complementary incomplete 
paragraph from sequentially subsequent blocks. In this 
embodiment, multiple matches may be found and preferably 
paragraphs that are close in Euclidean distances are 
matched. 
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A syntactic analysis may be executed alternatively or into multiple rows. The recombining of cells may be espe- 

additionally to the above-described incomplete paragraph cially desirable where process 1812 assigns portions of the 

location process. Parsing rules may be used to determine if table as heading such that the correct heading is displayed in 

the combination of the last and the first incomplete sentences each display page that displays portions of the table, 
of two paragraph blocks parse correctly according to English 5 1° one embodiment, improper or erroneous cell breaks 

grammar rules. between the rows may be determined by locating the upper 

FIG. 19 shows a flow diagram of step 1812 for dividing and lowerY coordinates of each of the rows and determining 

the current block into portions for display such that each * hich of lhe <™ brea ^ mav . te ^proper based on 

portion is within the display parameter or configuration of me ^ter-row gaps. For example, the interline spacing within 

ibe display configuration of the output application or device. 10 a . ro . w bc ^ lhai !j hc WE I between two rows. A 

i- * . in «i j . . -r.u Vui i • . ui tc similar approach may be used to determine improper or 

First, step 1902 determines if the current block is a table. If erroneous ^ u ^^Wen columns. 

the current block is not a table step 1904 breaks up the Additionally or alternatively, based on the nominal cell 

current block into elements such that each element can be breaks> imvtopcT or erroneous cell breaks between the 

displayed within the display configuration. Each element of columns and/or rows may ^ determined by locating blank 

a paragraph block may be, for example, a word contained in 15 celk ^6 recombining the cells in order to eliminate such 

the paragraph. Other division of a block into elements may D i an k cdis m ^ optimal manner. For example, in a row 

be implemented. For example, each element of a list block where only one cell spans across two lines and the each 

may be an item or a line in the list. remaining cell only spans one line, the row may be improp- 

Step 1904 also sequentially displays each element until erly divided into two rows, resulting in all but one blank cell 

the display configuration limits are reached or all the ele- 20 in the lower or second row. The optimal elimination of the 

ments of the current block are displayed. Step 1904 contin- blank cells in the lower or second row may be to recombine 

ues to sequentially display the remaining elements of the the mostly blank row with the previous row. Again, a similar 

current block using a new display each time the display approach may be used to determine and remove improper or 

configuration limits are reached. Each element of the current erroneous cell breaks between the columns, 
block may comprise a word or a line, for example, which can 25 Certain rules may be set and applied to determine and 

be broken up into multiple lines and/or multiple words. remove excessive division of table cells. For example, a 

If the current block is a table, the first row and first column heading row or column may be all capitalized, larger font, 

of the table are selected as the row and column headings at bold, italics, and/or center aligned while the remainder of the 

step 1905. Although not all first rows and first columns of cells do not have some or all of these characteristics. Thus, 

tables are headings, it can be assumed that the first row and 30 if the first two rows or columns are all capitalized, larger 

first column are headings. A method may be implemented by font, bold, italics, and/or center aligned while the remainder 

which to discriminate between a heading row or column and of the cells do not have some or all of these characteristics, 

a data row or column. In addition, some input formats may it may be determined that the first two rows and/or columns 

identify headings of tables and that data can be utilized in should be recombined into one row or column. As is evident, 

this process. 35 numerous other methodologies may be utilized to determine 

Step 1906 determines the number of columns n that can the optimal table cell division, 
be displayed with the column heading, if any, within the In another embodiment, cell breaks may be additionally 

display configuration. The n non-heading columns are then or alternatively analyzed using semantic analysis to deter- 

selected and the selected elements or columns of the first mine correct heading. However, the semantic analysis may 

row are added to a subblock set as the current sub-block at 40 require a large amount of context knowledge because often 

step 1907. The n elements of the next row are selected as the an incomplete sentence with only noun or verb phrases are 

current row and added to the current sub-block at step 1908. used as headings. 

Step 1910 then determines if the current sub-block can be The above-described cell recombining process may be 

displayed within the display configuration. If the current performed at various points of process 1812. For example, 

sub-block can be displayed within the display configuration, 45 the recombining process may be performed when selecting 

then step 1911 displays the current sub -block. If the current the first row and column as the headings at step 1905, when 

sub-block cannot be displayed within the display determining the number of columns that can be displayed at 

configuration, then step 1912 removes the current row from step 1906, when selecting non-heading columns at step 

the current sub-block, displays the current sub-block, and 1907, and when selecting element of a next row at step 1908 

adds the current row to a new sub-block having the heading 50 or step 1916. 

as its first row. The new sub-block is also set as the current Further, a table may contain one or more sub-tables. In a 

sub-block. sub-table, a portion of a column and/or a row may be divided 

After step 1911 or step 1912, step 1914 determines into sub-columns and/or sub -rows. Such sub-tables may lead 

whether the current row is the last row of the table. If the to multiple row and/or column headings being displayed in 

current row is not the last row of the table, n elements of the 55 display pages. The above -described table detection algo- 

next row is selected as the current row and added to the rithm may be utilized to recursively search through table 

current sub-block at step 1916 and the process is continued cells to determine these sub-tables, 
from step 1910. If the current row is the last row of the table, FIG. 20 shows a sample document 2000. Sample docu- 

tben step 1918 determines if the last column displayed is the ment 2000 may be divided into four sub-pages by three 

last column of the table. If the last column displayed is not 60 sub-page breaks 2002, 2004, 2006. Sub-page breaks may be 

the last column of the table, then the process continues from determined by a block containing non-text or image extend- 

step 1906. If the last column displayed is the last column of ing across a threshold portion of the width of the page or 

the table, then the process is complete. document. For example, a sub-page break may be a line, as 

In certain circumstances, it may be necessary or desirable shown in FIG. 20, an image or picture, or series of dashes 

to recombine certain cells of a table because the table may 65 or other repealing character, extending across, for example, 

have been excessively divided. For example, if a row spans at least 70% of the width of the page or width of the page 

two or more lines, the single row may have been subdivided within margins, if any. 
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Sample document 2000 contains tables 2008, 2010, 2012. 
The sequence for displaying the elements of the sub-page 
between sub-page breaks 2004, 2006 is also shown in FIG. 
20 by arrow 2014 wherein the blocks of the sub-page are 
sequenced from top to down, from left to right. 5 

The sub-page between sub-page breaks 2004, 2006 of 
document 2000 includes headings 2016. Headings 2016 are 
preferably identified either in the process of converting an 
input format document to an intermediate format document, 
or during reformatting process 1800. The headings may be 1Q 
used to automatically generate a list or table of contents. 
Generation of a table of contents may be an option selected 
by a user or set as default. Preferably, the table of contents 
may be inserted as a first display page in reformatting 
process 1800. Each heading displayed in the first display 
page preferably includes a link to the display containing the 15 
heading and its associated content. 

Alternatively, particularly if reformatting process 1800 is 
performed on-the-fly, the link of the heading displayed in the 
table of contents displayed page is to the heading within the 
output format document and not to a specific display page. 20 
When a user selects the link of the heading displayed in the 
table of contents displayed page, the reformatting process 
1800 ignores all contents occurring prior to the selected 
heading such that the user is presented with a display page 
having the selected heading as the first content displayed. In 25 
other words, breaks between display pages may differ 
depending upon the link or heading selected by the user. 

In this embodiment, reformatting process 1800 preferably 
can generate display pages in reverse order. For example, 
after a user selects a heading in the table of contents and 30 
views a display page displaying the selected heading as the 
first content, the user may select a previous page. Then 
reformatting process 1800 preferably determines, in reverse 
sequence, blocks and/or portions of blocks that can be 
displayed within the display parameters of the display 35 
configuration. 

FIGS. 21A-F show the five display pages into which 
sample document 2000 may be divided in order to fit as 
many elements or sub-blocks of the sub-pages onto each 
display page. Note that each of tables 2008, 2010, 2012, 40 
2014 is displayed on a single display page and is not 
displayed across multiple display pages as these tables are 
within the display configuration requirements of the output 
display device. 

FIG. 22 shows a sample table 2200 which may be 45 
contained within a document. FIGS. 23A and 23B show 
sample display pages by which table 2200 may be displayed. 
As shown, at least a portion of the first row forming the row 
heading of sample table 2200 is displayed in each of the 
display pages. Further, at least a portion of the first column 50 
forming the column heading of sample table 2200 is dis- 
played in each of the display pages. In the display page 
shown in FIG. 23A, the first two columns of all rows of the 
table in addition to the column heading is displayed. In the 
display page shown in FIG. 23B, all rows of the remaining 55 
three columns subsequent to the last column displayed in 
FIG. 23A are displayed, in addition to the column heading. 
Although not shown in this example, the rows of sample 
table 2000 may also be divided to be displayed across 
multiple display pages. 60 

FIG. 24 shows a schematic of a system 2300 over which 
service for converting data representing a document into an 
output format document may be provided over a network 
2304. FIG. 25 shows a flow diagram of the service for 
converting data representing a document over the network. 65 

The service for converting data representing a document 
may be provided by a computer system 2302 over a network 
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2304, such as the Internet or an intranet. Network 2304 may 
be connected to a server 2306 which provides documents, 
such as webpages, in an input format. Network 2304 may 
also be connected to output devices such as PDAs 2308, 
laptop computers 2310, and desktop PCs 2312. Although not 
shown, many other devices such as a cellular telephones and 
pagers may also be connected to network 2304. 

When computer system 2302 receives a request from an 
output device such as PDA 2308 to display a document 
supplied by server 2306, computer system 2302 may 
execute process 2500 for converting an input format docu- 
ment to an output format document. Specifically, process 
2500 includes receiving an input document over the network 
at step 2502. A virus detection program is preferably 
executed to detect for the presence of viruses in the input 
document at step 2504. If a virus is detected, step 2506 sends 
a message over the network to the user or the requesting 
device that the input document contains viruses. 
Alternatively, if the document containing a virus can be 
repaired, the document may be repaired and the process 
continues to step 2508. 

If no virus is detected or if a virus is detected and 
removed, step 2508 determines if the input document is in 
a supported format. If the input document is not in a 
supported format, process 2500 ends. If the input document 
is in a supported format, the input document is converted to 
an intermediate format document at step 2510. The inter- 
mediate format document is in turn converted to an output 
format document at step 2512. This conversion process may 
be as described above, including reformatting as necessary 
or as requested such that a single page of the input document 
may be separated into multiple display pages. 

A table of contents may be generated using headings as 
described above and inserted in the output format document 
at step 2514. In addition, particularly if more than one output 
format is generated at step 2512, an executable program, 
such as a JAVA™ script, may be inserted into the output 
format document at step 2514. Although described in terms 
of a JAVA™ script, other programming languages such as 
Common Gateway Interface (CGI), Visual Basic, Practical 
extraction and reporting language (Perl), C, and C++ may, of 
course, be utilized. Preferably the JAVA™ script is inserted 
to the beginning of the output format document. The 
JAVA™ script may be executed by the display device such 
as the PDA to select a suitable output format from the 
plurality of output formats generated for display. The suit- 
able output format may depend upon, for example, the 
display device and/or the browser used by the display 
device. The output format document is then sent or delivered 
over the network to the user or the requesting device at step 
2516. Where more than one output format is generated, an 
output document may be generated for each output format or 
a single output document may be generated for all output 
formats. In either case, the JAVA™ script is preferably 
inserted into each output document. 

The user may provide the input document or the location 
or address of the input document, such as an Internet web 
address, for example. The specific output format may also be 
specified by the user or may be determined depending upon 
the requesting application or output display device. The 
request and other information from the user may be deliv- 
ered to computer system 2302 via electronic mail, Internet 
or intranet, for example, over a network 2304. 

Where the input document is converted to multiple output 
format documents, the output documents may be stored in 
memory of computer system 2302 at least until the appro- 
priate output format document is displayed by the output 
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display device. Alternatively, all the output format docu- 
ments may be sent to the output display device and the 
suitable output format may be determined by executing the 
JAVA™ script as described above. In another alternative, 
process 2500 may generate only one output document in an 5 
output format requested by the user or determined to be the 
appropriate format display able by the output display device. 
Thus, process 2500 may dynamically convert the input 
format document to the appropriate output format document 
depending upon the appropriate output display format. 10 

Preferably, process 2500 may also include determining if 
a browser of the output display device supports certain 
executables contained in the original input document. For 
example, as noted above, the intermediate and output format 
documents preferably retains any embedded animation, 15 
sounds and/or music, as well as the execution of links to start 
up other applications. Thus, process 2500 may determine if 
any or all of such executables contained in the original input 
document is supported or executable by the output display 
device. If certain of such executables is not supported or 20 
executable by the output display device, process 2500 may 
remove such embedded executables to avoid error messages. 
Alternatively, conversion step 2510 may automatically 
remove or retain such embedded executables depending 
upon the format of the output document. 25 

In another embodiment, certain optimization steps may be 
performed in order to optimize the output for specific 
browsers or specific characteristics. For example, process 
2500 may optimize the output document where the output 
display device utilizes INTERNET EXPLORER™, 30 
NETSCAPE™ or process 2500 may optimize the output 
document for space, accuracy, and/or output as single or 
multiple files. These parameters may be set to certain 
defaults and/or specific by the user. The user may also 
specify a text only or image only output. Alternatively, 35 
conversion step 2510 may perform such optimization steps. 

FIG. 26 shows a flow diagram illustrating a process 2600 
for generating a knowledge base or document repository 
using one or more storage formats. FIG. 27 shows a sche- 
matic of a system 2700 in which the knowledge base or 40 
document repository using a uniform storage format may be 
used. 

As shown, the above-described conversion process may 
be utilized to generating a knowledge base or document 
repository of documents in various input formats using, for 45 
example, a single uniform storage format. The documents 
stored in the storage format can also be subsequently con- 
verted to other output formats for display on a display 
device. Preferably, where a single uniform storage format is 
utilized, the output format is HTML Version 4.0. However, 50 
other storage formats may be utilized. 

Process 2600 first creates an index document which 
contains JAVA™ script preferably at the beginning of the 
document. The JAVA™ script, as described above, may be 
executed by the display device such as the PDA to select a 55 
suitable output format from the plurality of output formats 
generated for display. Other programming languages may be 
utilized although JAVA™ is preferred. The index document 
may be utilized by a search engine, for example, to search 
for document containing certain key words. Each keyword 60 
contained in the index document may include links to the 
keyword contained in one or more input documents. 

Process 2600 then locates and inputs an input document 
or file at step 2604 and determines if the input document is 
in a supported input format at step 2606. If the input 65 
document is in a supported input format, step 2608 converts 
the input document to one or more different output format 
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documents. Conversion step 2608 is preferably as described 
above, utilizing an intermediate format. Preferably, an index 
of all or certain key words of the input document is gener- 
ated and inserted into the index document at step 2610. In 
addition, a table of contents is preferably generated at step 
2612 for each output format document in the corresponding 
output format and inserted into the corresponding output 
format document. A JAVA™ script may be inserted into the 
output format document at step 2615, preferably at the 
beginning of the output document. The JAVA™ script, as 
described above, may be executed by the display device 
such as the PDA to select a suitable output format from the 
plurality of output formats generated for display. Other 
programming languages may be utilized although JAVA™ is 
preferred. 

After step 2614 or if step 2606 determines that the input 
document is not in a supported input format, step 2616 
determines if there are any other input files. If there are other 
input files, process 2600 continues from step 2604. If there 
are no other input files, process 2600 is complete. 

A repository generated by process 2600 preferably stores 
the input documents in the input format as well as the one 
or more storage formats. As additional input documents are 
received by the repository, process 2600 converts each 
additional input document to one or more storage formats. 
Where more than one storage format is utilized, a single 
converted document may be generated containing the input 
document in multiple storage formats. Alternatively, mul- 
tiple storage documents may be generated, each in a differ- 
ent storage format. 

The knowledge base or document repository generated by 
process 2600 may be used in conjunction with input-output 
format converter including the display reformatting function 
described above. For example, a request may be made from 
a PDA to view a document from the repository. The input 
and repository storage formats may be different from a 
format suitable for display on the PDA. The input-output 
format converter may be utilized to convert the storage 
format repository document to an output format document 
suitable for display on the PDA. 

The system 2700 shown in the schematic of FIG. 27 
utilizes the knowledge base or document repository gener- 
ated using process 2600 described above. System 2700 
includes a document converter 2702 coupled to a network 
2704 and a computer system 2706 storing the knowledge 
base or document repository. Document converter 2702 may 
be similar to that described above wherein a document may 
be converted to an intermediate format document and then 
to a document in a different format. Network 2704 may be 
the Internet or an intranet, for example. Various display 
devices 2708 may be coupled to network 2704. Examples of 
display devices include PDAs, laptop computers, desktop 
PCs, internet connected televisions, cellular or wireless 
telephones, pagers, and/or voice-only products. Other con- 
figurations of system 2700 may be implemented to utilize 
the knowledge base or document repository generated by 
process 2600. 

While the above is a complete description of preferred 
embodiments of the invention, various alternatives, 
modifications, and equivalents can be used. It should be 
evident that the invention is equally applicable by making 
appropriate modifications to the embodiments described 
above. Therefore, the above description should not be taken 
as limiting the scope of the invention that is defined by the 
metes and bounds of the appended claims along with their 
full scope of equivalents. 
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What is claimed is: 

1. A computer implemented method of converting a first 
document in a first document file format to a second docu- 
ment in a second document file format different from the first 
document file format, comprising: 5 

locating first document file format data in the first docu- 
ment; 

grouping said first document file format data into at least 
one intermediate document file format block in an 
intermediate document file format document, including 10 
locating words in the first document, joining words into 
lines, and joining lines into paragraphs, each paragraph 
being one of said intermediate format blocks; 

locating tables, each table being one of said intermediate 
format blocks; and 

converting said intermediate document file format docu- 
ment to the second document in the second document 
file format using said intermediate document file format 
blocks. 20 

2. The computer implemented method of claim 1, wherein 
said grouping comprises: 

locating tags in the first document; and 
utilizing the tags in locating words, joining words into 
lines, joining lines into paragraph, and locating tables. 25 

3. The computer implemented method of claim 1, wherein 
each intermediate format block is selected from the group 
consisting of a word, a line, a paragraph, a table, and an 
image. 

4. The computer implemented method of claim 1, wherein 30 
each of the first format and second format is selected the 
group consisting of portable document format (PDF), rich 
text format (RTF), hypertext markup language (HTML), 
extensible markup language (XML), cascading style sheets 
(CSS), Netscape Layers, linked and separate pages, Tag 35 
Image File Format (TIFF), graphics interchange format 
(GIF), bit map (BMP), Joint Photographic Experts Group 
(JPEG), MICROSOFT WORD™, WORD PERFECT™, 
AUTOCAD™, and POWER POINT™. 

5. The computer implemented method of claim 1, wherein 40 
the second format is selected from hypertext markup lan- 
guage (HTML) and rich text format (RTF), comprising: 

determining coordinates of each intermediate format 
block; 

generating a second format block for each intermediate 45 
format block; 

generating a second format style sheet for each interme- 
diate format block, coordinates of each second format 
style sheet match coordinates of corresponding inter- 
mediate format block; 50 

mapping an intermediate format block font to second 
format font to fit second format block into second 
format style sheet; and 

placing each second format block into corresponding 55 
second format style sheet. 

6. The computer implemented method of claim 1, wherein 
the second format is an image bitmap format, comprising: 

generating bitmap of the intermediate format document 

using intermediate format blocks; and $q 
placing the bitmap into second image document. 

7. The computer implemented method of claim 1, wherein 
the first document is received over a network and the second 
document is sent over the network. 

8. The computer implemented method of claim 7, wherein 65 
the network is selected from the group consisting of Internet 
and an intranet. 
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9. The computer implemented method of claim 8, wherein 
the receiving and the sending is via electronic mail. 

10. The computer implemented method of claim 7, further 
comprising 

locating headings of the first document; 

generating a table of contents page containing the head- 
ings in the second format, each table of contents 
heading containing a link to the heading contained in 
the document; and 

placing the table of contents page into the second docu- 
ment. 

11. The computer implemented method of claim 7, 
wherein said converting said intermediate format document 
to the second format document is selected from the group 
consisting of: 

converting to the second format document in one second 
format; 

converting to the second format document in multiple 

second formats; and 
converting to the multiple second format documents, each 

in a different second format. 

12. The computer implemented method of claim 11, 
further comprising: 

generating a computer executable program for selecting 
one second format to be displayed; and 

inserting the computer executable program into the sec- 
ond document. 

13. The computer implemented method of claim 12, 
wherein the computer executable program is written in a 
programming language selected from the group consisting 
of a JAVA, Common Gateway Interface (CGI), Visual Basic, 
Practical extraction and reporting language (Perl), C, and 
C++. 

14. A computer implemented method of converting a first 
document in a first format to a second document in a 
different, second format in hypertext markup language 
(HTML), comprising: 

locating data in the first document; 

grouping data into at least one intermediate format block 
in an intermediate format document; 

converting said intermediate format document to the 
second HTML document using said intermediate for- 
mat blocks; 

generating a table of coordinates wherein at least a subset 
of said coordinates correspond to a coordinate of each 
intermediate format block; and 

placing each intermediate format block on the corre- 
sponding coordinate in the table of coordinates. 

15. The computer implemented method of claim 7, 
wherein said generating the table of coordinates comprises: 

determining gaps extending across the intermediate for- 
mat document; 

creating a macro table having cells corresponding to 
portions of the intermediate format document outside 
of said gaps; and 

recursively dividing each cell of the macro table by 
determining gaps extending across the cell until each 
cell cannot be further divided. 

16. A computer program product for converting a docu- 
ment in a first document file format to a document in a 
second document file format different from the first docu- 
ment file format, comprising: 
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computer code that locates first document file format data 
in the first document; 

computer code that groups said first document file format 
data into at least one intermediate document file format 
block in an intermediate document file format 5 
document, said computer code that groups includes 
computer code that locates words in the first document, 
joins words into lines, and joins lines into paragraphs, 
each paragraph being one of said intermediate format 
blocks; i° 

computer code that locates tables, each table being one of 
said intermediate format blocks; 

computer code that converts said intermediate document 
file format document to the second document in the 
second document file format using said intermediate 
document file format blocks; and 

a computer readable medium that stores the computer 
codes. 

17. The computer program product of claim 16, wherein 2 o 
the computer readable medium is selected from the group 
consisting of CD-ROM, zip disk, floppy disk, tape, flash 
memory, system memory, hard drive, and data signal 
embodied in a carrier wave. 

18. A computer implemented method for displaying a 2 s 
document, comprising: 

receiving a document for display; 

automatically locating sub-page breaks in the received 
document; 

subdividing the received document into sub-pages using 30 

the sub-page breaks; 
locating blocks within each sub-page; and 
sequentially displaying at least a portion of each block of 
the sub-pages within display parameters of a display 35 
configuration, including determining if each block can 
be displayed within display parameters of the display 
configuration and dividing a block not within display 
parameters into portions to be within the display 
parameters of the display configuration, said dividing a 40 
block including: 

deterniining if the block is a table; 

if the block is not a table, sequentially displaying each 

element of the block until all element of the block are 

displayed; 45 
if the block is a table: 

determining the headings of the table and subset of 
non-beading columns of the table display able 
within the display parameters; 

display the subset of non-heading columns of all 50 
rows of the table; and 

continue determining a next subset of non-heading 
columns of the table displayable within the dis- 
play parameters and displaying those columns of 
all rows of the table until all rows and all columns 55 
of the table have been displayed. 

19. The computer implemented method for displaying a 
document of claim 18, wherein the document is in a markup 
language format. 

20. The computer implemented method for displaying a 
document of claim 18, further comprising: 



locating headings of the document; 

generating a table of contents page containing the 
headings, each table of contents heading containing a 
link to the heading contained in the document; and 

placing the table of contents page into the second docu- 
ment. 

21. A computer program product for maintaining a reposi- 
tory of first documents in at least one storage document file 
format, comprising: 

computer code that receives at least one first document, 
said at least one first document being in at least one first 
document file format; 

computer code that converts the first documents in the at 
least one first document file format to storage docu- 
ments in the at least one storage document file format, 
said storage document file format containing storage 
format blocks, said computer that converts includes 
computer code that locates words in the first 
documents, joins words into lines, joins lines into 
paragraphs, each paragraph being one of said storage 
format blocks, and locates tables, each table being one 
of said intermediate format blocks; and 

a computer readable medium that stores the computer 
codes. 

22. The computer program product of claim 21, wherein 
the computer readable medium is selected from the group 
consisting of CD-ROM, zip disk, floppy disk, tape, flash 
memory, system memory, hard drive, and data signal 
embodied in a carrier wave. 

23. The computer program product of claim 21, further 
comprising computer code that converts the storage docu- 
ments to a display document. 

24. The computer program product of claim 21, further 
comprising 

computer code that locates keywords in the first docu- 
ments; and 

computer code that generates an index document of the 
located keywords, the index document containing the 
keywords, each keyword containing at least one link to 
the keyword contained in at least one first document. 

25. The computer program product of claim 21, further 
comprising: 

computer code that generates a computer executable 
program for selecting one second format; and 

computer code that inserts the computer executable pro- 
gram into the second document. 

26. The computer program product of claim 21, further 
comprising: 

computer code that locates headings of the first docu- 
ments; 

computer code that generates a table of contents page for 
each first document, the table of contents page con- 
taining the headings, each table of contents heading 
containing a link to the heading contained in the first 
document; and 

computer code that places the table of contents page into 
the second document. 
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